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Although sampling has been mentioned as part of the chance and data 
component of the mathematics curriculum since about 1990, little research 
attention has been aimed specifically at school students' understanding of this 
descriptive area. This study considers the initial understanding of bias in 
sampling by 639 students in grades 3, 5, 7, and 9. Three hundred and forty-one 
of these students then undertook a series of lessons on chance and data with an 
emphasis on chance, data handling, sampling, and variation. A post-test was 
administered to 285 of these students and two years later all available students 
from the original group (328) were again tested. This study considers the initial 
level of understanding of students, the nature of the lessons undertaken at each 
grade level, the post-instruction performance of those who undertook lessons, 
and the longitudinal performance after two years of all available students. 
Overall instruction was associated with improved performance, which was 
retained over two years but there was little difference between those who had 
or had not experienced instruction. Results for specific grades, some of which 
went against the overall trend are discussed, as well as educational implications 
for the teaching of sampling across the years of schooling based on the 
classroom observations and the changes observed. 


Traditionally sampling was of minimal interest in statistics courses. 
Assumptions were made about random samples from normal distributions 
and then interest turned to h 5 ^othesis testing and confidence intervals. 
Sampling was more the domain of experimental scientists and social 
researchers. The aim was to be sure that the sample collected satisfied the 
criteria to hand the data over to the statisticians or computer programs to 
chum out statistics and p -values. The school curriculum reflected this 
approach in introducing students to the arithmetic mean in the middle years 
and the standard deviation, permutations, and combinations in the senior 
years. The advent of exploratory data analysis and its influence on the school 
curriculum since the National Council of Teachers of Mathematics' Standards 
(1989), have brought sampling to the forefront of the chance and data part of 
the mathematics curriculum. This perspective is reflected in A National 
Statement on Mathematics for Australian Schools (Australian Education 
Council [AEC], 1991) in band B for upper primary students with six types of 
activities to enable students to "understand what samples are, select 
appropriate samples from specified groups and draw informal inferences 
from data collected" (p. 172). Students are now expected to collect their own 
samples, explore the implications using descriptive statistics, and make 
judgements about claims, long before they are introduced to formal statistics 
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at the senior secondary level. It is hence important today for students to 
develop an appreciation of whaf sampling enfails and fo appreciafe fhe 
similarifies and disfincfions between a sfafisfical sample and a sample of 
food handed ouf in fhe supermarkef. 

In fhe classroom, however, anecdofal evidence suggesfs fhaf sampling 
only gefs passing menfion. Unlike mosf parfs of fhe mafhemafics curriculum, 
which require calculafions fo come up wifh specific answers, sampling is a 
fopic described in words. Tesf questions would require answering in words 
nof numbers and fhis sorf of quesfion is offen unpopular wifh sfudenfs and 
feachers alike. Sampling is more like a fopic one would expecf fo find in a 
science course. If is fo be hoped fhaf some of fhe moves towards quanfifafive 
liferacy (e.g., Madison & Sfeen, 2003) will help change fhese affifudes in fhe 
classroom and fopics such as sampling will receive increased affenfion. In fhe 
fransifion fime, research can address fhe issues of sfudenf undersfanding and 
abilify fo learn. 


Previous Research 

The early research wifh respecf fo undersfanding of sampling was relafed fo 
the influence of sample size on decision-making. Tversky and Kahneman 
(1971) began wifh a sfudy of college sfudenfs and suggesfed fhaf fhere was a 
fendency for fhem fo believe fhaf a sample, no matter how small, should 
represenf fhe populafion exacfly. They coined fhe term representativeness 
heuristic for fhis belief (Kahneman & Tversky, 1972) and spawned many 
sfudies relafed fo judgmenfs in sifuafions of uncerfainfy. Alfhough fhere has 
been some confroversy abouf fhe complexify of fhe problems sef by Tversky 
and Kahneman (Evans & Dusoir, 1977; Gigerenzer & Hoffrage, 1995) and in 
relafion fo whefher quesfions are asked based on fhe cenfre of fhe 
disfribufion or fhe fail (Well, Pollafsek, & Boyce, 1990), if is generally 
acknowledged fhaf issues of sample size and represenfafiveness are 
imporfanf, parficularly for sfudenfs younger fhan fhose involved in fhese 
early sfudies. 

Inferesf in school sfudenfs' developing ideas of sampling has been 
considered from several angles, depending on fhe cormecfions fo ofher 
aspecfs of fhe chance and dafa curriculum fhaf are considered imporfanf by 
researchers. Fischbein and Schnarch (1997), for example, used problems 
direcfly from Tversky and Kahneman (1971) wifh school sfudenfs, whereas 
Esfepa, Bafanero, and Sanchez (1999) gave sfudenfs specific pairs of dafa sefs 
fo compare as samples, and Reading and Shaughnessy (2000) asked sfudenfs 
fo imagine sampling in a probabilify setting. The specific issue of variabilify 
in sampling was considered for primary sfudenfs by Wagner and Gal (1991) 
in the context of comparing fwo dafa sefs of equal or unequal size; fhey 
found a dilemma for sfudenfs befween belief in homogeneify and 
anficipafed variafion. Rubin, Bruce, and Tenney (1991) found a similar 
fension for senior high school sfudenfs in wanfing bofh variafion and 
represenfafiveness in samples. Mefz (1999), who worked wifh primary 
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students designing their own science experiments, observed a range of 
beliefs about sampling from "it really is not important" to "it is necessary to 
measure an entire population in order to reach decisions." In a more 
comprehensive analysis of these data (Metz, 2004), based on students' 
conceptualisations of uncertainty in scientific investigafions, issues relafed to 
sampling were often discussed in relation to students' explanations for the 
uncertainty The connection of represenfafiveness fo average was 
highlighted in the work of Mokros and Russell (1995), buf nof explicifly fied 
back to samples and sampling. 

If representativeness is a quality to be sought in sampling, then bias is 
the other extreme and is to be avoided. Less research has focused specifically 
on this aspect. Jacobs (1997, 1999) worked with primary children and found 
fhat although in some situations they could identify potential sources of bias, 
fhey sometimes suggested spurious reasons for bias. They also experienced 
conflicf in considering fairness, for example, in selecting a sample in relation 
to the desires of fhe people selecfed. Schwarfz, Goldman, Vye, Barron, and 
The Cognition Technology Group at Vanderbilt (1998) observed similar 
results with school students of the same age. In a survey study of students' 
understanding of sampling, Wafson and Morifz (2000a) found thaf 20% of 
grade 8 sfudents could identify bias in at least one of two media contexts; by 
grade 11 this percentage rose to 66%. Based on in-depth interviews with 62 
students in grades 3, 6, and 9, they found that one grade 6 and eleven (of 17) 
grade 9 students were sensitive to bias (Watson & Moritz, 2000b). 

The basic definition of what constitutes a sample was considered by 
Watson and Moritz (2000a, 2000b) with the questions, "If you were given a 
'sample,' what would you have?", and "Have you heard of the word 
'sample' before? Where? Whaf does it mean?" Four levels of response were 
observed reflecting the number of elements of relevance included in the 
description (from 0 to 3). The adequacy of this description was supported in 
later work of Watson and Kelly (2003) with a different group of sfudenfs. 
Working with older students in an instructional setting, Saldanha and 
Thompson (2002) described different concepts of the sample-population 
relationship as being additive, where a sample is seen only in terms of fhe 
parf-whole subsef relafionship, not as multiplicative, which also includes a 
"quasi-proportionality" relationship reflecting the features of the population. 
For students starting with more background in their study, these responses 
were parallel to the highest two levels observed by Watson and Moritz 
(2000a, 2000b). 


Research Questions 

As part of a larger study of school students' understanding of variability 
in relation to the chance and data curriculum and intervention to improve 
understanding, the following research questions were addressed in relafion 
to the understanding of bias in sampling based on responses to survey 
questions. 


Cognition and Instruction: Reasoning about Bias in Sampling 


TJ 


1. What are the initial understandings of students in grades 3, 5, 7, 
and 9? 

2. What change in understanding occurs after instruction in chance 
and data emphasising variation? 

3. What level of undersfanding is susfained affer fwo years for 
sfudenfs who experienced insfrucfion provided by fhe projecf 
and for fhose who did nof? How do fhese compare? 

4. How do sfudenfs in longifudinal grades 5, 7, and 9 compare 
fo fhe cohorfs fwo years earlier? 

Methodology 

Sample 

The sample presenfed here consisfs of 639 sfudenfs from grades 3, 5, 7, and 9 
in fen public schools in fhe Ausfralian sfafe of Tasmania who were surveyed 
as parf of a larger sfudy on school sfudenfs' undersfanding of sfatisfical 
variafion including quesfions on basic chance, chance variafion, dafa 
variation and sampling variation. Earlier analyses of fhese ifems have been 
reported in Watson and Kelly (2002a, 2002b, 2002c) for all grades. The sample 
sizes used in this study for each grade af each sfage of fhe invesfigafion are 
given in Table 1. The number of sfudenfs in fhe currenf analysis is smaller 
than reported in the earlier analyses of Wafson and Kelly, as fhe currenf 
analysis aims fo deal wifh fhe quesfions relafed fo sampling only. 


Table 1 

Number of Students in Each Grade 



Grade 3/5* 

5/7' 

7/9' 

9/11' 

Total 

Sample (Pre) 

143 

181 

151 

164 

639 

Sample (Post-Intervention) 

57 

80 

76 

72 

285 

Sample (Long.-Intervention) 

36 

53 

51 

23 

163 

Sample (Long.-Non Intervention) 

47 

35 

53 

30 

165 


’ Grade in the longitudinal follow-up 


Quesfions relafed fo sampling were on fhe lasf half of fhe survey and some 
were nof affempfed by some sfudenfs. To ensure a realisfic dafa sef on 
sampling, fhe aufhors delefed sfudenfs who did nof affempf af leasf fwo of 
fhe five "sefs" of ifems in Figure 1 (Ql, Q2, Q3-Q8, Q9, or QlO-Qll) 
defermined by physical placemen! on fhe survey. 

Alfhough nof separafed for fhe inifial analysis in fhe sfudy, 341 sfudenfs 
were in schools where feaching infervention took place as parf of fhe sfudy 
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and 298 students were in schools with no intervention from the researchers. 
The reduction of fhe number of sfudenfs from fhe pre-fesf fo fhe posf-fesf for 
sfudenfs in fhe infervenfion schools was due fo sfudenfs having fransferred 
from fhe school or being absenf on fhe day fhe second survey was 
adminisfered, approximafely six weeks affer fhe completion of fhe feaching 
infervenfion. The refenfion rafe ranged from 76% fo 91% over fhe four 
grades. For fhe longifudinal follow-up survey fwo years lafer, all grade 5 or 
9 sfudenfs in fhe same schools from grades 3 or 7 were surveyed again. For 
grade 7, all sfudenfs from grade 5 in fhe associafed feeder primary schools 
were surveyed. No fransfers fo ofher high schools were fraced. Between 
grade 9 and 11, all sfudenfs eifher leff school or fransferred fo a regional 
senior secondary college. Sfudenfs were fraced fo four regional senior 
schools for fhe final survey and fhe number was reduced mainly fhrough 
sfudenfs leaving school or nof wishing fo confinue as parf of fhe sfudy. 

The sfudenfs surveyed were from fen schools considered fo be fypical of 
fhose in fhe sfafe, each wifh a spread of academic abilify. Five schools were 
in a relafively affluenf area, fhree as feeder primary schools for fwo local high 
schools. The infervenfion high school had one infervenfion and one non- 
infervenfion feeder primary school. In fhe infervenfion high school, fwo 
grade 9 classes of "average" abilify were assigned fo fhe projecf, whereas fhe 
fwo grade 7 classes were of average fo higher abilify. In fhe non-infervenfion 
high school all sfudenfs who were surveyed in grades 7 and 9 were of a range 
of abilify levels. Five schools were in a less affluenf area wifh fhree primary 
schools being feeder primary schools fo fwo local high schools. The 
infervenfion high school had fwo infervenfion feeder primary schools. In fhe 
infervenfion high school af grade 9 fhree classes reflecfed differenf abilify 
levels due fo sfreaming of sfudenfs, whereas fhe fhree grade 7 classes were 
of mixed abilify. In fhe non-infervenfion high school fhere was a mix of 
abilify levels in bofh grades 7 and 9. Five schools experienced insfrucfion and 
five did nof. 

Tasks 

The sampling questions shown in Figure I were parf of a larger survey 
designed fo assess school sfudenfs' undersfanding of sfafisfical variafion in 
relafion fo fhe topics addressed in fhe chance and dafa curriculum. Quesfions 
QI fo Q5, Q8, and Q9 were answered by sfudenfs af all four grade levels, 
whereas Q6 was answered only by sfudenfs in grades 5, 7, and 9; Q7, QIO, 
and QII were answered only by sfudenfs in grades 7 and 9. Q6 and Q7 were 
omiffed wifh younger sfudenfs fo shorfen fhe time of adminisfrafion of fhe 
survey, and QIO and QII were only used wifh high school sfudenfs because 
of fhe subjecf matter included. Q9 was fhe lasf parf of a quesfion related fo 
reading information from a two-way table about participation of boys and 
girls in four sports at a school sports day. Q2 to Q8 were adapted from the 
work of Jacobs (1999), reflecting the standard accepted understanding of 
sampling appropriate for school-age students. QIO and QII were used in an 
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earlier study by Watson and Moritz (2000a). These questions and the rest of 
the items in the complete survey administered in the larger study are 
discussed and analysed in terms of sfudenf undersfanding of sfafisfical 
variation by Wafson, Kelly, Callingham, and Shaughnessy (2003). 


Ql. What does "sample" mean? 

Give an example of a "sample". 

Q2. A class wanted to raise money for their school trip to Movieworld on the Gold 
Coast. They could raise money by selling raffle tickets for a Playstation 2. 

But before they decided to have a raffle they wanted to estimate how many 
students in their whole school would buy a ticket. 

So they decided to do a survey to find out first. The school has 600 students 
in grades 1-6 with 100 students in each grade. 

How many students would you survey and how would you choose them? 
Why? 

Q3. Shannon got the names of all 600 children in the school and put them in a hat, 
and then pulled out 60 of them. 

What do you think of Shannon's survey? 

□ GOOD □ BAD □ NOT SURE 
Why? 

Q4. Jake asked 10 children at an after-school meeting of the computer games club. 
What do you think of Jake's survey? 

□ GOOD □ BAD □ NOT SURE 
Why? 

Q5. Adam asked all of the 100 children in Grade 1. 

What do you think of Adam's survey? 

□ GOOD □ BAD □ NOT SURE 
Why? 

Q6. Raffi surveyed 60 of his friends. 

What do you think of Raffi's survey? 

□ GOOD □ BAD □ NOT SURE 
Why? 


Figure 1 (cont.). Questions on sampling used in the survey. 
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Q7. Claire set up a booth outside of the tuck shop. Anyone who wanted to stop 
and fill out a survey could. She stopped collecting surveys when she got 60 
kids to complete them. 

What do you think of Claire's survey? 

□ GOOD □ BAD □ NOT SURE 
Why? 

Q8. Who do you think has the best survey method? Why? 

Q9. A primary school had a sports day where every child could chose a sport to 
play. Here is what they chose. 



Netball 

Soccer 

Tennis 

Swimming 

Total 

BOYS 

0 

20 

20 

10 

50 

GIRLS 

40 

10 

15 

10 

75 


a) How many girls chose Tennis? 

b) What was the most popular sport for girls? 

c) What was the most popular sport for boys? 

d) How many children were at the sports day? 

e) The teacher wanted to choose four children to lead the closing parade. 
Suggest two fair ways she could have chosen them. 

The following article appeared in the Hobart Mercury. 

Decriminalize drug use: poll 

SOME 96 percent of callers to youth radio station Triple J have said marijuana 
use should be decriminalized in Australia. The phone-in listener poll, which 
closed yesterday, showed 9924 - out of the 10,000-plus callers - favoured 
decriminalisation, the station said. 

Only 389 believed possession of the drug should remain a criminal offence. 
Many callers stressed they did not smoke marijuana but still believed in 
decriminalizing its use, a Triple J statement said. 

QIO. What was the sample size in this article? 

Qll. Is the sample reported here a reliable way of finding out public support 
for the decriminalisation of marijuana? Why or why not? 


Figure 1. Questions on sampling used in the survey. 


Procedure 

The survey was administered in class time by the authors along with the 
classroom teachers, all offering help when required fo read ifems, 
parficularly in grades 3 and 5. Approximafely 45 minufes was allocafed for 
completing fhe surveys. The same survey was given fo fhe same grade af 
each fesfing, hence for fhe fwo-year longifudinal survey, ex-grade 3 sfudenfs 
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answered grade 5 questions and ex-grade 5 students answered grade 7 
questions and so on. In comparing for longitudinal retention, only questions 
answered in the initial survey were counted, whereas all questions used in 
the final year could be used for cross-cohorf comparisons. 

Students in grades 3 and 5 in three of the six primary schools were 
taught a 10-lesson unit on chance and data emphasizing variation by a 
primary-trained mathematics specialist teacher provided by the project. The 
unit was taught over an 8-week period with two sessions at each school for 
each grade for each lesson. The content of the sessions is described in detail 
in Watson and Kelly (2002a) and summarized below. The students who 
received this intervention were administered the survey three times, initially 
(pre), six weeks after the instruction (post), and two years later 
(longitudinal). In the other three primary schools, there was no intervention 
from the project team. These students were only administered the survey 
twice, initially (pre) and two years later (longitudinal). 

Session 1 of the 10-lesson unit used in the primary schools was an 
investigation of fhe contents of small packefs of Smarfies Begirming with 
a discussion of the information provided on the outside of the packet, 
students then worked in pairs to "find ouf" abouf the contents, creating 
column graphs of the Smarties’’’'^ sorted by colour. The discussion centred on 
the numbers of Smarties’’’'^ in each packet, the different colours, and the 
number of each colour in the individual packets. Variation among packets (as 
samples of the manufacturing process) was a focus of the class discussion as 
were the combined class data. 

Session 2 aimed to develop ideas about defining the data to be collected, 
representing the data in different ways, and describing the general shape of 
fhe data. Data were collected, after suitable definitions were agreed to 
(differing from class fo class), on fhe number of people in the children's 
families. Sfudenfs themselves created people graphs and then used blocks 
before putting sticky dots on a class graph. There was discussion on the 
"most common" numbers in families, "oufliers," and what could be said 
about half of the class, introducing the middle of the data. When describing 
the shape of fhe class graph, studenfs fended fo focus on individual features 
using terms like "chimney," but after encouragement to look more generally, 
began to use terms like "a mountain" or "a roller coaster" to describe 
variation observed. 

The following fwo sessions were about chance, the first dealing with 
equally probable events using a spinner and a single die, and the second 
dealing with non equally probable events arising from the summing of 
outcomes when two dice are tossed. Students carried out repeated trials, 
recording outcomes and combining them as a class, again describing shapes 
of data, e.g., "a box" for a single die and "a hill" for summing two dice 
outcomes. The idea of gaining confidence when more data are collected was 
discussed. Sampling was the specific focus of fhe next two sessions with 
contributions of examples of samples from class members and overall 
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agreement on a basic definition. Selecting representative samples from fhe 
class populafion became a confenfious issue when sfudenfs agreed fo 
random mefhods of selecfing sfudenfs, each of whom had an equal chance of 
being selecfed, buf fhen were concerned abouf fairness if repeafed sampling 
resulfed in a sfudenf being selecfed a second fime before everyone in fhe 
class had had a furn af being selecfed. Jacobs (1997, 1999) reporfed similar 
issues arising in her sfudy. Sampling of cubes of fwo differenf colours from 
opaque bags, recording fhe resulfs over many frials, and comparing fhe 
oufcomes wifh expecfafion based on fhe bag's confenfs, were relatively 
sophisficafed fasks for sfudenfs in grades 3 and 5. 

Sessions 7 and 8 were based on sfudenfs' measuremenfs of how long fhey 
could sfand on one foof wifh fheir eyes closed (Rubin & Mokros, 1990). Again 
decisions were made on dafa collection and represenfafion in order fo compare 
fwo groups of dafa, for example, leff and righf feef, or boys and girls. The final 
fwo sessions allowed sfudenfs fo make decisions and sef up fheir own 
investigations fo answer questions relafed fo blowing a pencil across a smoofh 
surface. In aU of fhe sessions collecting or describing dafa was a feafure and 
variafion in samples was poinfed ouf af all poinfs where if occurred. Alfhough 
nof always explicifly sfafed, "sampling" was a fimdamenfal idea used 
fhroughouf fhe unif of work wifh grade 3 and 5 sfudenfs. 

In grades 7 and 9, fhe regular mafhemafics feachers in fwo of fhe four 
secondary schools, delivered fhe imif of work fo fheir classes, five classes in 
fofal in each of grades 7 and 9. The sfudenfs in fhe ofher fwo secondary 
schools received no infervenfion from fhe research feam. In fhe schools 
where fhere was insfrucfion, fhere were nine differenf feachers involved, 
wifh one feacher faking fwo grade 7 classes. Because of fhe lack of confrol of 
exacfly whaf would be faughf and when, a comprehensive package of six 
small unifs of work, possibly encompassing several lessons, was prepared 
for fhe high school feachers. The fopics included variafion involved in frials 
of spinners, in oufcomes wifh dice, in repeafed sampling, in measuring 
associafion, in comparing fwo groups, and in fhe numbers of chocolafe chips 
in cookies (Brighf, Harvey, & Wheeler, 1981; Lappan, Fey, Fifzgerald, Friel, & 
Phillips, 1998; Loviff & Lowe, 1993; Torok, 2000; Wafson, 2002a). The 
sampling acfivifies were similar fo fhose used wifh grades 3 and 5, and fhe 
associafion and comparing groups unifs were based on fhe measuremenf of 
hand span and foof lengfh. Alfhough sampling was fhe specific focus of one 
unif, various f 5 q)es of samples were employed in all ofher unifs fo collecf 
dafa for analysis. Discussion of variafion in sampling was hence inf ended fo 
be widespread across fhe unifs. 

The aufhors mef wifh fhe grade 7 and 9 feachers in fheir schools, 
explaining fhe purpose of fhe projecf and disfribufing fhe 21 pages of lesson 
plans and 33 pages of associafed documenfs (e.g., work sheefs and copies of 
relevanf pages from books). Suggesfions as fo fhe order in which fhe maferial 
mighf be faughf were made, buf final decisions were leff fo fhe judgmenfs of 
fhe feachers. If was understood fhaf fhe unifs were likely fo be more 
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comprehensive than some teachers would be able to fit into their programs, 
although, as yet, no teachers had taught chance and data that year. As it 
turned out, there was considerable variation in the number of lessons faughf 
by fhe high school feachers. Table 2 shows fhe number of lessons faughf and 
fhe confenf of fhe lessons faughf for each class in grades 7 and 9, wifh all 
feachers including af leasf one lesson on dice, buf only one feacher touching 
on fhe chocolafe chip cookie problem. 


Table 2 

Number and Content of Lessons Taught for Each Class in Grades 7 and 9 


Grade/ 

Class 

Unit 1: 
Spinners 

Unit 2: 
Dice 

Unit 3: 
Sampling 

Unit 4: 
Association 

Unit 5: 
Comparing 
Groups 

Unit 6: 
Cookies 

Total 

7A 

4 

2 

3 

0 

0 

0 

9 

7B 

4 

2 

3 

0 

0 

0 

9 

7C 

0 

3 

0 

0 

0 

0 

3 

7D 

1 

3 

0 

1 

0 

0 

5 

7E 

3 

3 

2 

2 

4 

0 

14 

9F 

0 

3 

1 

2 

0 

0 

6 

9G 

4 

3 

1 

2 

0 

2 

12 

9H 

1 

1 

0 

1 

0 

0 

3 

91 

0 

5 

0 

0 

0 

0 

5 

91 

3 

4 

4 

0 

3 

0 

14 


As in the primary schools, students in the two secondary schools who 
received the intervention were administered the survey three times: initially 
(pre), six weeks after the instruction (post), and two years later 
(longitudinal). In the other two secondary schools, where there was no 
intervention, the students were administered the survey twice only: initially 
(pre), and two years later (longitudinal). 

Analysis 

Responses to the questions in Figure 1 were coded hierarchically to reflect an 
increasing appreciation and understanding of fhe concepf of sample and bias 
in sampling. For Ql, which asked for a definifion and an example of fhe term 
"sample", a Code 3 was given fo responses fhaf recognized fhe parf-whole 
relationship along wifh fhe purpose fo "fesf". Code 3 responses infegrafed all 
three of fhese appropriate ideas wifh examples (e.g., "A fesfer, an example, 
usually random, a small portion of fhe real fhing. Af a supermarkef you 
mighf fry a juice sample, a tiny cup so you can gef a fasfe or idea"). A Code 
2 response incorporated any fwo of fhe appropriate ideas in Code 3, be fhey 
fhe parf-whole relationships (e.g., "A piece of somefhing, wafer faken from 
the river is a water sample") or the purpose to test (e.g., "A little taste or try 
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of something, a small sample of chocolate"). A Code 1 was given to 
definitions with single ideas of either quantity (e.g., "A bit") or the purpose 
of a sample (e.g., "To try something"), or to examples only (e.g., "Blood 
sample"). A Code 0 was given to confused ideas of sample (e.g., "The trial 
run, the no's 1-6 on a dice"), or idios 5 mcratic responses, or no response to 
the question. 

For Q2, which asked students to nominate how many people they would 
survey out of a school of 600 (100 in each of grades 1-6), and how they would 
choose them, five codes were developed from the four-code scheme used by 
Watson et al. (2003) to show an increasingly appropriate understanding of 
sampling methods. The highest code, a Code 4 was given to responses that 
combined an appropriate sample size with a random or representative and 
random method of selection (e.g., "I would randomly choose 10 students 
from every grade"); a Code 3 was given to the same kind of response, but 
with an urmecessarily large sample size (e.g., "I would choose 50 people out 
of each grade randomly picked"). Code 2 responses focused on 
representative methods of selection only, with either appropriate or 
inappropriate sample sizes (e.g., "I would ask 5 boys and 5 girls from each 
grade, which would make 60 students or 10%"), whereas Code 1 responses 
were non-representative and biased, regardless of the sample size given (e.g., 
"I would survey 10 students from each grade and pick them by whoever 
came first to volunteer"), or they focused on a method or on a sample size 
only, but not both in one response (e.g., "Ask people you see in the 
playground" or "100 of them"), or they wanted to survey the entire 
population (e.g., "I would survey the whole school, so that I would know 
exactly how many"). Code 0 was given to inappropriate responses, which 
misinterpreted the intent of the question, or to no response. 

Q3 to Q7 were related to Q2, however, they provided different scenarios 
on how other h 5 q)othetical students decided to conduct their surveys of the 
school. The codes for each of the questions are presented in Table 3, adapted 
from Table 3 in Watson et al. (2003). Codes for Q3 to Q7 showed an increasing 
appreciation of sampling methods through critiquing others' methods. Code 
3 responses provided an appropriate statistical critique of the method. Code 
2 responses focused either on non-central issues regarding the method or on 
appropriate statistical issues with an element of uncertainty (noted by the 
"not sure" response). Code 1 responses focused on inappropriate issues, 
such as methods that create rather than remove potential bias, whereas Code 
0 responses were idios 5 mcratic. 


Cognition and Instruction: Reasoning about Bias in Sampling 


35 


Table 3 

Coding Categories of Response for Questions 3 to 7 (adapted from Watson et 
al, 2003) 


Code 

Q3 

Shannon's 
method 
[all grades] 

Q4 Jake's 
method 
[all grades] 

Q5 Adam's 
method 
[all grades] 

Q6 Raffi's 
method 
[grades 5-9] 

Q7 Claire's 
method 
[grades 7-9] 

3- 

Appropriate 

statistical 

response 

Random 
methods: 
"Good, 
because it's a 
good random 
way to 
survey" 

Detecting 
bias and 
small sample 
size: 

"Bad, not 
enough 
people and 
selectively 
picked" 

Detecting 

bias: 

"Bad, not 
enough 
different age 
groups" 

Lack of range 
and/or 
variation: 
"Bad, they 
would 
probably say 
the same 
thing" 

Appropriate 

criticism: 

"Bad, some 
kids might 
go twice" 

2 - Non- 

Adequate 

Uncertainty: 

Large sample 

Adequate 

Adequate 

central ideas 

sample size: 

"Not sure. 

size: 

sample size: 

sample size: 

or 

"Good, 

because not 

"Bad, too 

"Good, you 

"Good, you 

uncertainty 

there's a lot 

many 

many 

get a lot of 

just have 


of people" 

different 

people" 

answers" 

enough" 



people 

Uncertainty: 

Uncertainty: 

Uncertainty: 



would go 

"Not sure. 

"Not sure, it 

"Not sure. 



there" 

because 

depends how 

because 




that's only 

many of his 

people who 




one class but 

friends have 

thought it 




he surveyed 

different 

was a bad 




the most 

opinions" 

idea 




people" 


wouldn't 






bother" 

1 - In- 

Method too 

Creating 

Non- 

Creating 

Creating 

appropriate 

random: 

bias: 

represent- 

bias: 

bias: 

analysis 

"Bad, he 

"Good, to 

ative: 

"Good, 

"Good, it is 


could pick 

give them a 

"Good, 

because they 

their own 


the wrong 

hint to buy 

because it is 

are his 

choice" 


people" 

one" 

fair" 

friends" 


0-In- 

Misinterpret 

Misinterpret 

Misinterpret 

Misinterpret 

Misinterpret 

appropriate 

question: 

question: 

question: 

question: 

question: 

logic 

"Bad, too 

"Good, so 

"Bad, none 

"Good, more 

"Good, first 


many 

you could 

might not 

money for 

in best 


people" 

play it" 

buy any" 

them" 

served" 
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Question 8, again related to Q2 to Q7, asked students to choose which 
survey method they thought was the best one and why. Although students 
in different grades were presented with a different number of pofenfial 
survey mefhods (grade 3: Sharmon's, Jake's, Adam's; grade 5: Sharmon's, 
Jake's, Adam's, Raffi's; grade 7 and 9: Shannon's, Jake's, Adam's, Raffi's, 
Claire's), all grades were presenfed fhe sfafisfically appropriafe mefhod fo 
choose (Shannon's), and could fherefore receive fhe highesf code of 3 by 
choosing fhis mefhod as fhe besf one, combining if wifh a sfafisfically 
appropriafe reason (e.g., "Shannon, if was random and he doesn'f ask his 
friends"). A Code 2 response also chose Shannon, buf wifh an inappropriafe 
reason or for no reason, or chose Shannon wifh anofher inappropriafe choice 
(some responses said fhere were fwo besf mefhods). A Code 1 was given fo 
responses fhaf focused on any of fhe ofher four options (or combinafion of 
opfions if fwo were selecfed) and provided an inappropriafe sfafisfical 
reason (e.g., "Claire, because if would gef children from all ages and wifh 
differenf inferesfs"), a reason based on fairness (e.g., "Claire, if's fairer"), or 
a mefhodological reason (e.g. "Adam, because he asked fhe mosf people and 
could times his resulfs by six fo gef an average"). Code 0 was given fo 
responses fhaf were idiosyncratic or fo no response. 

Question 9 was a fable-reading exercise abouf a sporfs day. Q9a) fo Q9d) 
were basic fable reading ifems and are nof analysed here. Question 9e), 
however, focused on sampling and asked sfudenfs fo suggesf fwo fair ways 
of picking children fo lead a closing parade. For a code of 4, one ouf of fhe 
fwo responses focused on random and represenfafive mefhods (e.g., "2 girls 
and 2 boys ouf of a haf"), or fwo responses were clearly disfincf chance 
mefhods (e.g., "Pick ouf of a haf" and "Poinf fo fhem wifhouf looking"); for 
a Code 3, af leasf one response had fo be a simple chance mefhod (e.g., "Puf 
all fhe names in a haf and pull fhem ouf"). For Code 2 af leasf one response 
had fo be a represenfafive mefhod by using two factors (gender/ sporf) (e.g., 
"Could have chosen 2 girls + 2 boys of which parficipafed in everyfhing"), 
and for a Code 1, af leasf one response needed fo be represenfafive using one 
factor (e.g., "Two boys and fwo girls"). Code 0 responses were again 
idios 5 mcrafic or no response. 

Codes for QIO, which involved reading fhe sample size sfraighf from fhe 
arficle, were coded righf-wrong, wifh Code 1 being for responses "10 313," 
"10 000+," or "10 000." Question 11, on fhe ofher hand required sfudenfs fo 
critique fhe sample mefhod and claim of findings in fhe arficle. Four codes 
used by Wafson and Morifz (2000b) were given fo responses, wifh Code 3 
responses giving multiple appropriafe criticisms of fhe sampling mefhod 
used (e.g., "No, nof every one lisfens fo Triple J and only fhe people who 
wanf fo ring up will") and Code 2 responses focusing on single specific 
biases such as, fhe radio sfafion (e.g., "No, only people [who listen fo Triple 
J], because if's nof random"), youfh (e.g., "No, JJJ is a youfh radio sfafion, old 
people listen fo Magic 107..."), and fhe response fo phone polls (e.g., "No, 
because nof all people will be bofhered calling"). A code of 1 was given fo a 
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variety of responses. Some code 1 responses critiqued the article for sample 
size issues, but without an appreciation of the part-whole relationships of 
sampling (e.g., "No, there are still heaps more people in Australia"), whereas 
others focused on the biases created by the type of callers phone polls attract 
(e.g., "No, because some could be users", "No, because some could lie"). 
Other Code 1 responses gave appraisal of the sampling process without 
recognition of the potential bias (e.g., "Yes, majority"). Code 0 responses 
were again idiosyncratic or misinterpretations, or were no response. 

Except for QIO, which was coded 0/1, all questions had a maximum 
score of 3 (8 questions) or 4 (2 questions). The hierarchical rubrics produced 
an ordering for scores for questions from least to most statistically 
appropriately and when totalled the scores for grade 3 students could reach 
23, for grade 5, 26, and for grades 7 and 9, 33. The scoring produced totals 
distributed as in most classroom testing, roughly normal with a slightly 
higher representation of 0 scores for grades 7 and 9 and a slight indication 
of skewness to the right for grade 3. Total scores of 0 represented responses 
to at least two of the five "sets" of items indicated in the section describing 
the sample. 

T-tests were used to compare differences in means on total scores for the 
common items for each pair of grades (e.g., grade 3 and 5; grade 5 and 7; 
grade 7 and 9) for all students initially surveyed. Paired t-tests were used for 
comparing pre-test and post-test total scores for each grade, and to compare 
the pre-test and post-test total scores with the longitudinal follow-up for 
students who experienced intervention lessons. For the non-intervention 
students paired t-tests were used for pre-test and longitudinal total scores 
only. To consider potential differences in improvement for the intervention 
and non-intervention students, difference scores were calculated for each 
student and the means of these compared for the two groups with t-tests. T- 
tests were also used to compare mean total scores for each grade in the 
original year of testing with mean scores for the equivalent grade two years 
later in the longitudinal follow-up. Figure 2 highlights this last comparison. 
These last comparisons were carried out separately for intervention and non- 
intervention schools. 


Yearl 


Years 


Grade 3 



Grade 5 


Grade 7 


Grade 9 


Figure 2. T-test comparisons for students in the final year of testing with students 
in the equivalent grades two years earlier. 
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Altogether 44 t-tests were performed and the conservative Bonferroni 
correction suggests a significance reduction from 0.05 fo 0.0011. In fhe lighf 
of fhe information provided by fhe large number of p-values less fhan 0.05 
(28) compared fo fhe expecfed number (2.2), all p-values less fhan 0.05 are 
reporfed for consideration, however. The effecf sizes for fhese differences 
were defermined using Cohen's (1969) mefhodology and are reporfed wifh 
descriptors devised by Cohen (1969) and Izard (2004). The effecf sizes were 
calculated using Coe's Effecf Size Calculafor (Lane, 2003). 

Results 

Descriptive Analysis of Items and Overall Performance 
by Grade in the Pre-test 

For Q1 fo Q7, Table 4 shows fhe overall percenfage correcf for each quesfion. 
Quesfion 1 was answered by all grades and asked for a definifion of fhe ferm 
sample. The majorify of sfudenfs were able fo give a single idea associafed 
wifh fhe ferm or gave an example only (Code 1) buf could nof go furfher 
(40.2%), or were unable fo define fhe ferm af all (32.5%). More grade 7 
(12.6%) and 9 (11.6%) sfudenfs reached fhe optimum level of response (Code 
3), and roughly an equal percenfage of sfudenfs in grades 5 and 7 responded 
af Code 2 (18.2% and 21.2%, respectively). Grade 3 was fhe only grade fo 
have a majorify of Code 0 responses (63.6%), and fhe percenfage of Code 0 
responses declined over fhe grades fo 29.3% in grade 5, 22.5% in grade 7, and 
18.3% in grade 9. The modal response code for all grades excepf grade 3 was 
Code 1. 

For Q2, more sfudenfs responded af Code 1, citing non-represenfafive 
mefhods and sample sizes, or fhey gave mefhods or sample sizes only, or 
wanfed fo survey fhe entire population. Very few sfudenfs responded by 
giving random mefhods of selection (Codes 3 and 4) and no sfudenfs in 
grade 3 gave appropriate sample sizes fo mafch. The highesf level of 
response for grade 3 was Code 3 (random mefhods wifhouf appropriate 
sample size). When answering fhis quesfion, sfudenfs seemed fo find if 
difficulf fo formulate fheir own appropriate mefhods of sampling. 

The nexf five questions, Q3 fo Q7, asked sfudenfs fo critique proposed 
mefhods of sampling, mosf of fhem flawed wifh biases, wifh fhe excepfion of 
Q3. For Q3, which was fhe random mefhod of Sharmon, over half fhe 
sfudenfs eifher did nof respond, or did nof give a reason for fheir decision. 
Of fhose who answered above Code 0, Code 1 was fhe modal response; 
however, fhere were almosf as many Code 2 responses. Code 1 responses, in 
fhis case, were inappropriate criticisms to this method, focusing on fhe 
perceived inaccuracy of fhe random mefhod, imfaimess of opporfunify, and 
small sample size. Code 2 responses reflecfed non-cenfral appraisals, more 
specifically in relafion fo fairness and fhe sample size. Only 5.8% of sfudenfs 
overall appropriafely appraised fhis mefhod cifing "random" and/or 
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Table 4 

Percentages of Responses for Each Code for Q1 to Q8 (all students who answered 
the question in the pre-test) 



Code 0 

Code 1 

Code 2 

Code 3 

Code 4 

Ql 

32.5 

40.2 

19.9 

7.4 

NA 

Q2 

27.5 

51.2 

14.5 

4.8 

1.9 

Q3 

54.6 

20.5 

19.1 

5.8 

NA 

Q4 

46.6 

14.4 

11.0 

28.0 

NA 

Q5 

39.3 

20.5 

23.5 

16.7 

NA 

Q6‘ 

39.9 

18.5 

33.3 

8.3 

NA 

Q7" 

44.1 

43.2 

7.9 

4.8 

NA 

Q8 

31.3 

37.4 

13.1 

18.2 

NA 


’ Answered by Grades 5 to 9 only 
^ Answered by Grades 7 and 9 only 


"range" in their reasons, with an increase from 0% in grade 3 to 14% in 
grade 9. The percent of Code 0 responses dropped as grade increased, wifh 
a small rise from grade 5 fo 7, dropping again af grade 9. 

Sfudenfs found Q4, Jake's mefhod, easier fo respond fo fhan Q3; 
however, fhe modal response remained af Code 0. Sfudenfs from each of 
grades 3 fo 9 were able fo defecf fhe bias in fhe selection process, or 
mentioned the small sample size. The percentage of sfudenfs responding af 
Code 3 rose monofonically from 7.7% in grade 3 fo 45.7% in grade 9. There 
was a similar pattern of performance for Q5 on Adam's mefhod; however, 
fhe modal response for fhose who answered above Code 0 was a Code 2. 
Code 2 responses focused on non-cenfral crificisms, in fhis case fhe large 
sample size and fairness, or expressed some doubf in fhe crificism and were 
classified as sfafisfically uncerfain. The percenfage of Code 3 responses per 
grade rose monofonically from 3.5% in grade 3 fo 29.9% in grade 9. 

Alfhough only grades 5 fo 9 answered Q6 abouf Raffi's mefhod, fhe 
modal code of response was still Code 0 for fhis quesfion. For fhe remainder 
of sfudenfs who answered above Code 0, mosf responded af Code 2, again 
focusing on non-cenfral crificisms (e.g., fairness) and non-cenfral appraisals 
(e.g., sample size). Again, fhere was a monofonic rise in performance from 
grade 5 (2.2%) fo grade 9 (14.6%) in Code 3 responses. 

Only grades 7 and 9 answered Q7 based on Claire's mefhod, wifh 
approximafely 44% of sfudenfs overall responding af Code 0. Broken down 
by grade, fhis accounfed for 55% of grade 7 responses and 34.1% of grade 9 
responses. The modal response above Code 0 was only Code 1 and mosf 
sfudenfs gave inappropriafe reasons for fhe mefhod, based on perceived 
benefifs of range and variafion, fairness, and freedom of choice. Only 4.8% 
of grade 7 and 9 sfudenfs were able fo see fhe pofenfial biases in fhis mefhod 
of sampling. 
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Question 8 asked students to choose which they thought was the best 
method of sampling. Although all grades were presented with Q3, which 
was the statistically appropriate choice, the inclusion of Q6 and in particular 
Q7 for grades 7 and 9 had an impacf on fhe resulfs. Table 5 shows a decline 
by grade in fhe abilify fo choose fhe more appropriafe mefhod (Codes 2 and 
3). A break down by grade reveals fhaf 40.6% of grade 3 sfudenfs and 46.4% 
of grade 5 sfudenfs recognized Q3 as fhe appropriafe mefhod, whereas only 
17.9% of grade 7 sfudenfs and 21.9% of grade 9 sfudenfs were able fo do so. 
Alfhough grade 5 sfudenfs were nof disfracfed by fhe inclusion of Raffi's 
mefhod (Q6), fhe older sfudenfs were disfracfed by Claire's mefhod (Q7). 
Table 5 shows fhe percenfages of sfudenfs who chose each mefhod in grades 
3, 5, 7, and 9. 


Table 5 

Percentages of Responses for the Best Method for Q8 for all Students 
in the Pre-test 



Grade 3 

Grade 5 

Grade 7 

Grade 9 

Q3 (Shannon's method) 

39.9 

45.9 

15.9 

18.9 

Q4 (Jake's method) 

22.4 

14.8 

4.5 

2.3 

Q5 (Adam's method) 

21.0 

18.2 

8.6 

9.1 

Q6 (Raffi's method) 

NA 

10.5 

2.6 

1.2 

Q7 (Claire' method) 

NA 

NA 

45.0 

50.6 

Q3 (Shannon) and other combination (2) 

0.7 

0.5 

2.0‘ 

3.0‘ 

Q4, 5, 6, or 7 combination (2) 


0.5 



All or none 

1.4 

2.2 

4.6 

3.7 

Idiosyncratic, Don't know or no response 

14.7 

6.6 

17.9 

7.9 


’ Response denotes Shannon and Claire 


Question 9, which asked for two fair ways fo selecf sfudenfs for a parade, 
provided a variefy of responses. Qverall, fhe modal response for fhis 
question was a Code 1 (36.9%), which required af leasf one mefhod of 
selection fhaf was represenfafive based on one sef of facfors. The second 
mosf popular level of response, wifh 31.8%, was Code 3, which required af 
leasf one mefhod of selecfion using simple chance mefhods, such as picking 
names ouf of a haf or rolling a die. By grade, fhe pattern of responses fo Q9 
was inconsisfenf . A Code 4 response, which required af leasf one random and 
represenfafive mefhod of selecfion, or fwo disfincfly differenf chance 
mefhods of selecfion, was achieved by only 3.5% of grade 3 sfudenfs, 3.9% of 
grade 5, 4.6% of grade 7, and 0.6% of grade 9 sfudenfs. Similarly, 12.6%, 
44.7%, 36.4%, and 29.9% of sfudenfs in grades 3, 5, 7, and 9, respectively, 
achieved a Code 3 response. Code 0 responses made up 23.9% of fhe fofal 
responses, wifh grade 7 (32.4%) and grade 9 (36.6%) sfudenfs confribufing 
fhe mosf. This quesfion was devised fo fap info younger sfudenfs' 
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knowledge of sampling; however, perhaps this disadvantaged older 
students who saw the question as too trivial to deserve a complex answer. 

The results for Questions 10 and 11 were disappointing. Answered only 
by students in grades 7 and 9, 81.6% of students in QIO were imable to identify 
the sample size from the article and were hence coded 0. Similarly, in Qll, no 
student in either grade managed to respond at Code 3, and only 11.7% 
responded at Code 2, and 15.2% at Code 1. The majority of responses (73%) 
were coded 0. Approximately 47% of students did not attempt both questions, 
as QIO and Qll were the last on the survey. This accounts in part for the large 
number of Code 0 responses, over half in QIO. Overall, grade 9 students 
performed slightly better than grade 7 students on both QIO and Qll. 

Performance on all questions for each grade is given in Table 6. As can 
be seen, the mean total scores are quite low in comparison with the 
maximum total score possible. There is, however, an increase in performance 
based on the percentages of the maximum total score from grade 3 to grade 
5, with a drop from grade 5 to grade 7. There is another increase from grade 
7 to grade 9; however, this percentage is no greater than the percentage of the 
maximum total score for grade 5 students. 


Table 6 

Mean Total Scores and Standard Errors for Each Grade on the Pre-test 




G3 

(n=143) 

G5 

(n=181) 

G7 

(m=151) 

G9 

(n=164) 

Pre- 

Mean 

5.52 

9.57 

9.93 

12.08 


Std Error 

0.285 

0.336 

0.514 

0.468 


Maximum 

23 

26 

33 

33 


Mean as % of Maximum 

24.0 

36.8 

30.1 

36.6 


Difference Between Grades 

Mean total scores were also used to compare grades. Because grades 5 and 7 
completed more questions than the next lower grade, totals were adjusted to 
include only those questions that were common for the lower grade. For 
example, for the grade 3/5 comparison, grade 5 responses for Q6 were not 
included in the total. Table 7 shows a significant difference between grades 3 
and 5 and between grades 7 and 9 on the common items for the lower grade 
of each comparison, the greatest difference being between grades 3 and 5. For 
grades 5 and 7 there was a very small decrease in performance over the 
common items. There was a small difference between grades 5 and 9 (-1.81, 
p<.04) on these items. 
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Table 7 

Mean Total Scores, Standard Errors, T-tests, and Effect Sizes for Adjacent Grades 
on the Pre-test 



G3 

(n=143) 

G5 

(n=181) 

G5 

(n=181) 

G7 

(«=151) 

G7 

(n=151) 

G9 

(n=164) 

Pre- Mean 

5.52 

8.56 

9.57 

8.91 

9.93 

12.08 

Std Error 

0.285 

0.300 

0.336 

0.448 

0.514 

0.468 

t,p 

-7.18, p<.0001 

1.19, NS 

-3.09, p<.002 

Effect Size 

0.80 (Large) 

-0.13 (Very Small) 

0.35 (Small) 


Pre-Post Analysis for Intervention Students 

For the students who completed the post-test, paired t-tests were carried out 
for each grade. The mean total scores for the pre-test are reported again due 
to the reduced sample size. Table 8 shows that grades 3, 5, and 7 improved 
on the post-test after the teaching intervention to a small or medium extent; 
however, there was little improvement for grade 9 students. The similar 
means in grades 5 and 7 reflect the performance shown in Table 7 and the 
extra questions attempted in grade 7. 


Table 8 

Mean Total Scores, Standard Errors, Paired T-tests, and Effect Sizes for Each 
Grade on the Pre- and Post-test 



G3 

(n=57) 

G5 

(n=80) 

G7 

(n=76) 

G9 

(n=72) 

Pre- Mean 

6.60 

9.48 

9.49 

12.06 

Std Error 

0.497 

0.479 

0.629 

0.638 

Post- Mean 

8.00 

12.13 

13.01 

12.61 

Std Error 

0.670 

0.570 

0.808 

0.795 

t,p 

-2.38, p<.02 

-5.01, p<.0001 

-5.19, p<.0001 

-0.70, NS 

Effect Size 

0.31 (Small) 

0.56 (Medium) 

0.56 (Medium) 

0.09 (Very Small) 


In comparison with the pre-test percentages for each question that were 
given descriptively in the previous section and in Table 3, it is interesting to 
note that for fhe posf- sample fhe pattern of posf-test percenfages showed an 
increase in fhe optimum level of response for every question. This increase in 
optimum responses was complemented by a decrease in the percentages of 
Code 0 responses to all questions. The percentages of studenfs who selected 
the statistically appropriate method (Sharmon) as the best method in Q8 
revealed a mixed response in the post- sample, with a decrease for the grade 
3 and 9 students and an increase for the grade 5 and 7 students. It is of 
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interest to note that this increase corresponds with the grades that showed 
the greatest improvement overall in the post-test after the instruction. There 
was, however, an increase in the percentages of students in grades 7 and 9 
who inappropriately chose Claire as the best survey method in the post- 
sample. 

Table 9 shows the pre- and post- means, standard errors, f-tests and 
numbers of lessons faughf for fhe fen classes in grades 7 and 9. As can be 
seen, fhree of fhe grade 7 classes showed a significanf increase in mean scores 
on sampling, whereas one class showed a slighf improvemenf and anofher 
showed a significanf decrease, ft is inferesfing fo nofe fhaf fhe fhree classes 
fhaf had a significanf increase in mean score were also fhe classes who 
received lessons specifically in relation fo sampling. The ofher fwo classes, 
alfhough experiencing lessons fhaf addressed variafion fhrough sample 
frials (e.g., dice, spinners), did nof receive insfrucfion specifically focused on 
sampling. 

As seen in Table 9, fhree of fhe grade 9 classes also showed a significanf 
increase in fhe mean score on sampling, whereas one class showed a slighf 
improvemenf and anofher class had a significanf decrease in performance. 
For fhe classes fhaf showed a significanf increase in undersfanding, fwo had 
experienced lessons specifically relafed fo sampling (one each), wifh fhe 
fhird class experiencing none. Evidence from fhe feachers' journals suggesfs 
fhaf fhe class fhaf experienced no lessons was of a higher abilify level fhan 
fhe ofher classes in fhaf school; furfher, fhe class fhaf received four lessons 
specifically in relafion fo sampling and showed a minor increase in 
undersfanding sampling was of a lower abilify level. The class wifh fhe 
significanf decrease in performance also did nof receive any lessons focusing 
specifically on sampling. 


Table 9 

Pre- and Post- Means, Standard Errors, Paired T-tests and Total Number 
of Lessons for Each Class in Grades 7 and 9 


Grade / Class 

Pre Mean 

Std Error 

Post Mean 

Std Error 

t,p 

Lessons 

7A(n=17) 

10.47 

1.000 

16.12 

1.477 

-4.17, p<.0004 

9 

7B (n=17) 

12.00 

1.331 

17.53 

1.292 

^.56, p<.0002 

9 

7C (n=9) 

6.56 

1.324 

8.22 

1.690 

-1.69, NS 

3 

7D(n=9) 

9.11 

1.695 

5.00 

1.323 

2.75, p<.02 

5 

7E (n=24) 

8.25 

1.296 

12.42 

1.371 

-3.48, p<.002 

14 

9F (n=ll) 

13.27 

1.251 

16.45 

1.592 

-2.11, p<.04 

6 

9G («=9) 

7.67 

1.826 

12.78 

1.211 

-4.23, p<.002 

12 

9H (n=25) 

13.80 

1.112 

16.36 

1.139 

-2.43, p<.02 

3 

91 (m=14) 

12.79 

1.314 

4.86 

1.440 

5.17, p<.0001 

5 

9J(n=13) 

9.92 

1.337 

10.38 

1.328 

-0.29, NS 

14 
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Longitudinal Change 

Table 10 shows the pre-test, post-test and longitudinal means, standard 
errors, t-test values and effect sizes for sfudenfs who parficipafed in fhe 
longifudinal follow-up in fhe schools wifh infervenfion. Again, fhe pre- and 
posf- mean fofal scores are reporfed fo reflecf fhe reduced sample size. Even 
fhough each grade received fhe survey admmisfered two years earlier fo fhe 
equivalenf grade (e.g., grade 3 sfudenfs in 2000 now in grade 5 received fhe 
same survey as fhe grade 5 sfudenfs in 2000), fhe mean fofal scores reflecf 
whaf was achieved using only fhe ifems presenfed two years earlier. 

Table 10 

Mean Total Scores, Standard Errors, T- tests, and Effect Sizes for Each Grade on 
the Pre- and Post-test and the Longitudinal Follow-Up for the Students who 
Experienced Intervention 




G3/5' 

(n=36) 

G5/7’ 

(«=53) 

G7/9' 

(n=51) 

G9/1V 

(n=23) 

Pre- 

Mean 

7.19 

9.58 

9.04 

13.91 


Std Error 

0.656 

0.579 

0.813 

1.178 

Post- 

Mean 

7.75 

11.40 

11.71 

12.00 


Std Error 

0.885 

0.670 

1.008 

1.602 


t,p 

-0.71, NS 

-3.05, p<.002 

-2.72, p<.005 

1.17, NS 


Effect Size 

0.12 (Very Small) 

0.40 (Small) 

0.41 (Small) 

-0.28 (Small) 

Long. 

Mean 

11.31 

11.55 

16.10 

15.00 


Std Error 

0.739 

0.780 

0.957 

1.632 

(pre-) 

t,p 

-5.20, p<.0001" 

-2.73, p<.005" 

-6.75, p<.0001" 

-0.977, NS" 


Effect Size 

0.98 (Large) 

0.39 (Small) 

1.11 (Large) 

0.16 (Small) 

(post-) 

hp 

-4.86, p<.000P 

-0.20, NS^ 

-5.09, p<.0001=' 

-2.03, p<.03" 


Effect Size 

0.73 (Medium) 

0.03 (Very Small) 

0.63 (Medium) 

0.39 (Small) 


’ Grade in the longitudinal follow-up 
^ Pre-test to longitudinal follow-up 
^ Post-test to longitudinal follow-up 


For each grade the effect size of change in the post-test decreased from 
that observed for the larger sample sizes reported in Table 8, with grade 9 
showing a small decrease in performance. For this smaller group, after 
instruction there was an improvement (Pre- to Post-) for the grade 7 students, 
with a sustained (Pre- to Long.) and continued improvement over the two- 
year period (Post- to Long.). The grade 5 students also showed an 
improvement after the instruction and a sustained improvement long term 
over the two years (Pre- to Long.) but did not continue to improve after the 
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instruction like the grade 7 students (Post- to Long.). The specialized 
instruction had little effect on the grade 3 students (Pre- to Post-) but after 
two years the grade 3 students showed a large improvement. The grade 9 
students who showed a small decrease after instruction (Pre- to Post-), 
reversed this to a small improvement after two years (Post- to Long.). 

Examples to illustrate the increase in understanding after instruction 
(Pre- to Post-) with a sustained improvement over the two-year period 
(Long.) are given in Table 11 for fhe same individual sfudenfs for each 
quesfion. Alfhough only 25% of sfudenfs displayed fhis pattern of improved 
performance, if indicafes whaf is pofenfially achievable. 


Table 11 

Examples of Improvement over the Three Survey Conditions for Selected Students 


Question 

Grade 

Pre-test response 

Post-test response 

Longitudinal 

response 

Ql- 

7 

"To test 

"A small amount 

"Sample means 

Definition 


something out, 

of something. 

to take a bit of 



some food 

Something 

something and 



or wine or 

that has been 

test it. Like trying 



something 

tested" 

a bit of bun at the 



like that" 

(Code 3) 

bakery" 



(Code 1) 


(Code 3) 

Q2- 

7 

"Make them all 

"Choose them 

"I'd pick 20 

Movieworld 


do the survey . . . 

randomly, 60 

people from each 



600 students . . . 

students in each 

grade randomly. 



because they 

grade, because [it] 

that should give 



need money to 

would give you 

a clear enough 



buy tickets so the 

enough people 

answer, [because] 



more people who 

to get a good 

it makes sense" 



know about it the 

answer" 

(Code 3) 



better" 

(Code 3) 




(Code 1) 



Q3- 

9 

"Good, because 

"Good, because 

"Good, because 

Shannon's 


she knows how 

she picked people 

they were picked 

method 


many would buy 

randomly" 

at random" 



one" 

(Code 3) 

(Code 3) 



(Code 0) 



Q4 - Jake's 

5 

"Good, because it 

"Bad, he only had 

"Bad, only 

method 


will be good" 

10 people" 

10 people" 



(Code 0) 

(Code 3) 

(Code 3) 
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Q5 - Adam's 
method 

5 

"Good, he asked 
every single one" 
(Code 1) 

"Bad, it is only 
getting one 
classes opinion" 
(Code 3) 

"Bad, because 
he didn't get 
answers from all 
grades" (Code 3) 

Q6 - Raffi's 
method 

7 

"Good, because 
if you XIO you 
get 600 and they 
would be the 
same age" 

(Code 1) 

"Bad, because 
they would most 
likely [agree], 
they are his age" 
(Code 3) 

"Bad, because 
they all might feel 
the same way" 
(Code 3) 

Q7 - Claire's 
method 

7 

"Good, she's 
smart" 

(Code 0) 

"Bad, because 
only people 
interested would 
do it" (Code 3) 

"Bad, only people 
who would say 
yes would do it" 
(Code 3) 

Q8 - Best 
method? 

7 

"Claire, because it 
is just a good 
idea" (Code 0) 

"Shannon, 
because it's 
completely 
random" 

(Code 3) 

"Shannon, 
because it was 
more random. 

She had a chance 
to get the whole 
school's opinion" 
(Code 3) 

Q9 - Sports 
day parade 

3 

"2 girls, 2 boys" 
"Vote" (Code 1) 

"Name out of 
a hat" 

"2 girls out 
of a hat, 2 boys 
out of a hat" 

(Code 4) 

"Pulled name out 
of a hat" 

"Think of a 
number and the 4 
people who guess 
closest go" 

(Code 4) 

Qll - Media 

9 

"Yes, because 
people could ring 
up and have a 
say" (Code 1) 

"No, because it's 
not everyone, 
it's only the ones 
that listen to JJJ" 
(Code 2) 

"No, generally 
only young 
people listen 
to JJJ so it isn't 
a fair sample 


group over 
the whole 
Australia" 
(Code 2) 
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Table 12 contains similar pre-test and longitudinal survey results to 
Table 10, for the non-intervention schools. Each grade showed some 
improvement in performance over fhe fwo-year period. The greafesf 
improvemenf over two years was for grade 3. There was also a significanf, 
yef smaller degree of improvemenf for sfudenfs originally in grade 7 and 
grade 9. Alfhough fhese sfudenfs did nof experience any infervenfion from 
fhe research projecf feam, if is reasonable fo expecf some improvemenf over 
fime due fo fhe general school experience and mafurafion. For grades 3, 5, 
and 7, fhe effecf size of fhe improvemenf from fhe pre-fesf fo fhe 
longifudinal follow-up was nof quife as greaf for fhe sfudenfs in fhe non- 
rnfervenfion schools as if was for fhe sfudenfs in fhe schools where fhe 
infervenfion fook place. 

Table 12 

Mean Total Scores, Standard Errors, Paired T-tests, and Effect Sizes for Each 
Grade on the Pre-test and Longitudinal Follow-up for the Students who did not 
Experience Intervention 




G3/5' 

(h=47) 

G5/7* 

(n=35) 

G7/9* 

(n=53) 

G9/11* 

(n=30) 

Pre- 

Mean 

5.30 

9.71 

11.49 

15.60 


Std Error 

0.463 

0.744 

0.895 

1.009 

Long. 

Mean 

7.83 

10.48 

14.53 

18.20 


Std Error 

0.523 

0.973 

0.993 

1.271 


t,p 

^.22, p<.0001 

-0.93, NS 

-3.71, p<.0003 

-1.97, p<.03 


Effect Size 

0.75 (Large) 

0.15 (Small) 

0.44 (Small) 

0.41 (Small) 


’ Grade in the longitudinal follow-up 


Comparison of Longitudinal Change for Intervention and Non- 
Intervention Schools 

Table 13 confains fhe means, sfandard errors, fwo-failed f-fesf resulfs and 
effecf sizes in comparing fhe difference scores (longifudinal - pre-fesf) for fhe 
infervenfion and non-infervenfion sfudenfs af each grade level. 

Table 13 shows fhaf in grades 3 and 5 fhe infervenfion sfudenfs had a 
higher mean difference score fhan fhe non-infervenfion sfudenfs, buf nof 
significanfly so. Alfhough fhere is some indication fhaf fhere was a greafer 
positive difference for grade 7 sfudenfs in schools wifh classroom 
infervenfion, fhe differences for ofher grades were negligible. 

Change Within Schools Over a Two-Year Period 

Defecting change wifhin schools over fhe fwo-year longifudinal period was 
possible by comparing scores on common ifems for sfudenfs originally in 
grade 5 in fhe firsf year of fesfing (pre-fesf in 2000), wifh sfudenfs in grade 5 
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Table 13 

Mean Total Scores, Standard Errors, Tivo-tailed T-tests, and Effect Sizes for Each 
Grade on the Difference Scores for the Students in the Intervention and Non- 
Intervention Schools 



Intervention 

(Mean, Std Error) 

Non-Intervention 

(Mean, Std Error) 

hp 

Effect Size 

Grade 3/5' 

4.11, 0.790 

2.53, 0.600 

1.62, NS 

-0.36 (Small) 

Grade 5/7’ 

1.96, 0.719 

0.77, 0.826 

1.07, NS 

-0.23 (Small) 

Grade 7/9’ 

7.06, 1.045 

3.04, 0.817 

3.04, p<.002 

-0.60 (Medium) 

Grade 9/11’ 

1.09, 1.112 

2.60, 1.320 

-0.84, NS 

0.23 (Small) 


’ Grade in the longitudinal follow-up 


(originally in grade 3) in the third year of testing (longitudinal follow-up in 
2002) in both the schools that experienced intervention and the schools that 
did not. A similar comparison was carried out for students originally in 
grade 7 and grade 9 (see Figure 2 for clarification). Common questions were 
used for the comparisons. 

For the schools that experienced intervention. Table 14 shows an 
improvement in performance for sfudents in grade 5 in the longitudinal 
follow-up who had received instruction when they were in grade 3 two years 
earlier, when compared to the students originally in grade 5 before the 
intervention began. Similarly, students who were in grade 7 in the 
longitudinal follow-up who were originally in grade 5 and received 
instruction two years earlier, performed better than the original grade 7 
students did. There was a non-significant improvement in favour of the 
longitudinal grade 9 students compared to those in grade 9 originally. 


Table 14 

Mean Total Scores, Standard Errors, T-tests, and Effect Sizes for the Same Grade 
Two Years Apart in the Intervention Schools 



Pre-test (2000) 

(Mean, Std Error) 

Longitudinal (2002) 
(Mean, Std Error) 

t,p 

Effect Size 

Grade 5 

9.58, 0.579 (n=53) 

12.75, 0.819 {n=36) 

-3.25, p<.0001 

0.70 (Medium) 

Grade 7 

9.04, 0.813 (n=51) 

12.17, 0.827 {n=53) 

-2.70, p<.005 

0.53 (Medium) 

Grade 9 

13.91, 1.178 (n=23) 

16.10, 0.957 {n=51) 

-1.34, NS 

0.34 (Small) 


Table 15 shows that for the schools where the students did not 
experience intervention, there was no difference in performance between the 
students in grades 5, 7, and 9 in the longitudinal follow-up compared to the 
equivalent grades two years earlier. In fact in each case, there was a minimal 
drop in performance. 
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Table 15 

Mean Total Scores, Standard Errors, T-tests, and Effect Sizes for the Same Grade 
Two Years Apart in the Non-Intervention Schools 



Pre-test (2000) 

(Mean, Std Error) 

Longitudinal (2002) 
(Mean, Std Error) 

t,p 

Effect Size 

Grade 5 

9.71, 0.744 {n=35) 

8.76, 0.599 (n=47) 

1.00, NS 

-0.22 (Small) 

Grade 7 

11.49, 0.895 (n=53) 

11.26, 1.088 (n=35) 

0.16, NS 

-0.04 (Very Small) 

Grade 9 

15.60, 1.009 (n=30) 

14.53, 0.993 (n=53) 

0.70, NS 

-0.16 (Very Small) 


Discussion 

The educational messages from this study are mixed. On one hand it is 
encouraging to observe significant change with a medium effect size in some 
instances after instruction, along with increases in the percentages of the 
highest level responses and decreases in the percentages of the lowest level 
responses to the items in the survey. On the other hand, the average 
performances across grades would not be considered satisfactory in terms of 
classroom learning objectives, as observed in the coding levels described for 
the items used in the surveys. Also, for students in grades 3 and 9, the effect 
size after the teaching intervention was small or very small, respectively. The 
outcomes are considered in more detail in relation to the research questions, 
the limitations of the study, and the educational implications. 

Research Questions 

The initial understanding of sampling showed a dip in performance by 
grade 7 students. This was evident both in the relative mean scores as a result 
of the maximum possible score on the questions asked (see Table 6) and in a 
comparison of grades on common items answered (see Table 7). In the latter 
case there was a small effect favouring grade 5 over grade 7, and the positive 
difference favouring grade 9 reflects to some extent the drop at the grade 7 
level. There was only a small difference between grade 5 and grade 9, with a 
small effect favouring the grade 9 students. This dip in grade 7 performance 
was also observed in the larger study (Watson & Kelly, 2004) and has been 
seen in other studies of middle school students (Callingham & McIntosh, 
2002; Hill, Rowe, Holmes-Smith, & Russell, 1996). As evidence continues to 
accumulate from studies across mathematics topics and other areas of the 
curriculum, the issue of the middle school drop in performance will require 
considerable attention. 

The change in understanding observed after instruction was positive for 
each grade, although, as noted, the effect size was small for grade 3, and very 
small for grade 9. Students in grade 3 and grade 5 experienced the same 
lessons presented by the same teacher. Observation of the videotape of 
selected lessons indicated that specific discussion of sampling was a major 
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feature at both grade levels and it appeared that students were engaged in 
the tasks presented to them. It must be surmised that, in the short term, 
grade 3 students were unable to incorporate as much of the appreciation of 
sampling as the grade 5 students. As noted, there was less control by the 
researchers of the teaching that took place in grades 7 and 9. Anecdotal 
evidence (Watson & Kelly, 2002c) suggests that in each grade there was one 
classroom where the teacher experienced difficulty with the task, and that 
the grade 7 teacher who taught two classes was an enthusiastic participant in 
the project. It may also be relevant to recall that the grade 7 students started 
with a lower mean score than grade 5, and hence had quite a potential for 
improvement. 

Over the two-year period of the project, each of the four grade levels in 
the intervention schools displayed a different pattern of improvement. The 
grade 5 students, who improved in the short term, were the only group not 
to show at least a small further positive effect after two years. This result is 
consistent with the middle school dip in performance observed in the initial 
data for grades 5 and 7. For grade 3 and grade 7 students the improvement 
over two years was impressive but little can be attributed to the short-term 
effect of the instruction for grade 3. For grade 7, the effect for the larger group 
of students who completed the post-test was more impressive than for the 
smaller group still in the study after two years, suggesting positive change in 
both the short and long term. For grade 9 students, small effects were seen in 
both the short and long term. Again, this improvement cannot be attributed 
with confidence to the short-term effect of the instruction experienced in 
grades 7 and 9 respectively. Furthermore, for both grade 7 and grade 9, there 
was a large degree of fluctuation for individual classes. The potential shown 
for improvement by the examples of individual students' responses to 
particular items is encouraging. The challenge is to help a larger group of 
students achieve such sustained improvement. 

For the students in the schools that did not experience any intervention 
from the research team, the observed improvement over time was not 
surprising. "Chance and Data" has been a part of the Curriculum in 
Tasmania for a decade and it is assumed that the content is being taught in 
classrooms. It is known that several teachers in the non-intervention high 
schools attended Quality Teacher Programs, including sessions on chance 
and data led by the first author, at some stage after the initial testing; 
however, monitoring attendance at professional development seminars was 
not part of approved ethics procedures. The improvement for non- 
intervention schools highlights the need for caution when interpreting the 
long-term results of the students in the intervention schools, suggesting that 
the increased level of improvement in the longitudinal follow-up may have 
been due to other factors and not from the specific instruction implemented 
by the research team two years earlier. 

Comparing grades 5, 7, and 9, at the end of the study, with their 
equivalent cohorts two years earlier in both the intervention schools and 
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non-intervention schools, suggested an encouraging result in that there was 
an indication that the teaching interventions for grades 3, 5, and 7, 
respectively, may have produced a better overall appreciation of sampling 
fhan was presenf af fhe same grade levels when the project began in the 
intervention schools. In contrast to this, there was no improvement in 
student understanding of sampling in fhe non-in tervenfion schools in grades 
5, 7, and 9 in the third year of testing, compared to the original students in 
grades 5, 7, and 9 in these schools. 

Limitations 

Several limitations of fhe projecf and its design should be acknowledged. 
This study itself was not based on a random sample. Such sampling is 
usually impossible in educational settings and certainly when a teaching 
intervention is involved. The schools chosen, however, were representative 
of the state government education system in Tasmania, and likely other state 
systems in Australia. 

The control over the teaching sequence in grades 7 and 9 was much more 
limited than in grades 3 and 5. Providing a high school mathematics teacher 
for all classes within a complex high school timetable was beyond the 
financial resources of fhe projecf. If was also expecfed by fhe researchers fhat 
alfhough primary school teachers might be intimidated by elements of the 
chance and data curriculum, high school mathematics teachers should not 
be. This may have been a misapprehension, particularly in terms of teachers' 
motivation to teach and enthusiasm about the topics. As reported in Watson 
and Kelly (2002c) for the ten classes in grades 7 and 9, the overall correlation 
of number of lessons taught and the "post- - pre-" mean score on the larger 
survey of which fhe sampling subscale used in this study was a part, 
explained only 18% of the variance and was not statistically significant. 
Hence the number of lessons faught cannof be h 5 q)othesized as a predictor of 
motivation on the part of teachers to enthuse the students or of sfudents' 
greater achievement, either overall or in relation to sampling. Helme and 
Stacey (2000) encountered a similar situation when they provided resources 
for feaching decimals fo four willing primary teachers, with only one 
consistently using them. In their study, student outcomes were strongly 
related to teacher use of materials, a result not as evident in the current study 
that involved high school teachers. As noted earlier, in one high school, the 
grade 9 classes selected were said by the organising person to have students 
of "average" ability, rather than a wide range of sfudenfs including higher 
ability. Although this was catered for in terms of pre-test and post-test 
measurements, again it may have influenced fhe interesf and mofivafion of 
fhe students. 

Although sampling was the focus of some lessons, and discussion about 
samples took place in all grade 3 and 5 classes and it is assumed in most 
grade 7 and 9 classrooms, except for the definition of sample ifself, fhere was 
no specific reference during teaching to the items on the survey. In particular. 
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the Movieworld questions (Q2 to Q8 in Figure 1) were intended to be of a 
sufficiently general and familiar nafure fhaf fhey would measure fhe fransfer 
of undersfanding from acfivifies carried ouf in fhe classroom. If may be fhaf 
fhe discussion of bias in sampling in fhe classroom was nof sufficienfly 
similar fo fhe confexf of fhe ifems in fhe survey fo encourage fransfer. 
Perfinenf fo fhis quesfion is fhe ifem abouf Claire's mefhod (Q7 in Figure 1), 
which was only given fo sfudenfs in grades 7 and 9. As seen in Table 5, fhe 
presence of fhis ifem was a major disfracfer for fhese grades in defermining 
which sampling mefhod was fhe mosf appropriafe. 

As nofed in fhe description of fhe coding sysfem leading fo fhe scores 
used for measuring sfudenf performance, fhe rubrics represenfed fhe 
aufhors' views of sfafisfical appropriafeness for sfudenfs af fhe school level. 
The hierarchical nafure of fhe scoring is inherenf in fhis appropriafeness and 
in fhe increasingly complex sfrucfure observed in fhe responses (Pegg, 2002). 
Ofhers may have a differenf perspecfive on coding. 

From a measuremenf perspecfive fhe presence of Claire's mefhod (Q7) 
disadvanfaged fhe sfudenfs in grades 7 and 9 compared fo grades 3 and 5, 
and may confribufe marginally fo fheir smaller increase in performance 
levels wifh respecf fo fhe earlier grades. The presence of fhe quesfion from an 
educational standpoint, however, provides valuable information about 
students' beliefs concerning sampling. Ideas of fairness in a colloquial sense, 
and allowing for volunfary participation, are more important to students 
than avoiding bias by using a random method. Teachers need to be aware of 
fhese beliefs and make specific provision for discussing fhem in fhe 
classroom. 

Educational Implications 

Alfhough fhis infervenfion sfudy soughf fo compensafe for fhe researchers' 
perception fhaf previous feaching in relation fo fhe chance and dafa 
curriculum had neglecfed specific descripfive discussion of sampling, if is 
clear fhaf even more needs fo be done along fhese lines. Even fhough 
feachers may emphasize fhe imporfance of undersfanding samples and fhe 
purpose of avoiding bias, sfudenfs may nof appreciafe fhe imporfance and 
lose concenfrafion because numbers and calculations are nof being presenfed 
fo fhem, as is fhe percepfion of a normal mafhemafics classroom. The 
discussion of Jacobs (1999) is helpful in fhis regard for fhinking of sfudenfs 
in the upper primary years. As well as stressing the need to confront students 
by challenging the "fairness" and "self-selecfion" rationales, she suggesfs 
two further considerations for designing instructional activities. First, she 
suggests that teachers need to give students practice at making decisions 
from the results of mulfiple surveys, as sfudenfs fend fo aggregafe 
informafion from all surveys when drawing conclusions, even affer 
idenfifying biases wifh cerfain mefhods. Second, feachers should supply 
sfudenfs wifh surveys based on mulfiple sifuafions in a variefy of confexfs, 
including wifhin fhe school and in fhe oufside world. Surveys conducfed 
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outside the school context often result in students seeing more clearly the 
reason to use samples due to the larger population and the inability to 
survey everyone. 

Carrying out sampling activities in the classroom as suggested by 
Watson and Shaughnessy (2004) in the context of drawing handfuls of lollies 
from a confainer wifh a given percenfage of a cerfain colour, can also be 
useful. Sfudenfs' discussion of fheir own mefhods of drawing handfuls is 
likely fo bring ouf accusafions of cheating or bias on fhe parf of ofher 
sfudenfs. Activities such as fhis one link fo fhe chance parf of fhe curriculum 
in ferms of predicting oufcomes based on fhe proporfion of each colour 
presenf in fhe confainer. Repeafed sampling (wifh replacemenf) from 
mysfery bags confaining a small number of coloured objecfs, wifh fhe aim of 
guessing fhe number of objecfs of each colour, is anofher acfivify (used in 
some classes in fhis sfudy) fhaf can reinforce appropriafe ideas of sampling 
fechnique and sample size. Wafson (2002b) describes fhe bias fhaf occurs 
when dafa from two samples of size two are combined as if fhey were one 
sample of size four. Allowing sfudenfs fo experience such difficulfies and 
discover fhe consequences may be insfrumenfal in building appropriafe 
undersfandings of fhe sampling process. If is also possible fo infroduce 
sfudenfs fo fhe inferesfing history of fhe developmenf of sampling 
mefhodology wifhin fhe field of sfafisfics (e.g., Bernstein, 1998; Salsburg, 
2001). To hear of fhe difficulfies and debafes experienced over fhe pasf fwo 
cenfuries may help sfudenfs fo appreciafe fheir own dilemmas in 
considering bias in sampling. If will also help fhem prepare for more 
advanced work where subfle issues of sampling are considered in more 
defail fhan is possible in fhe middle and high school years. 

The poor performance of sfudenfs in grades 7 and 9 on questions related 
fo an article abouf a survey from fhe media was due, in parf, fo sfudenfs nof 
being able fo finish fhe survey in fhe timeframe fhaf was allowed; however, 
since 25% of sfudenfs responded in an idios 5 mcrafic marmer, if may also be a 
reflection on sfudenf imwillingness fo "read" questions on whaf is perceived 
fo be a mafhemafics fesf. Furfher, if may be related fo low liferacy levels or fo 
a lack of experience wifh crifical reading of fhe newspaper. As noted 
elsewhere (Gal, 2002; Wafson, 1997, 2000) fhe abilify fo read and question 
media arficles is an imporfanf consfifuenf of fhe sfafisfical liferacy needed by 
sfudenfs when fhey leave school. Learning fo question sampling procedures 
as presenfed by fhe media is an imporfanf parf of fhis abilify. Ifs imporfance 
is recognised in fhe Ausfralian National Statement (AEG, 1991) in a specific 
activify for sfudenfs, "Discuss and make judgmenfs abouf argumenfs and 
claims in fhe media for which sfafisfical information is presenfed (e.g. 
claiming fhaf 40% of fhe communify fhink fhaf fhe school leaving age should 
be raised on fhe basis of a felephone 'ring-in' poll)" (p. 172). 

The resulfs of fhis sfudy suggesf fhaf more research is needed info 
infervenfion programs fhaf seek fo improve sfudenfs' undersfanding of 
sampling and associafed bias wifhin fhe chance and dafa curriculum. The 
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use of student interviews, both initially (e.g., Watson & Moritz, 2000b), and 
longitudinally (e.g., Watson, 2004), to supplement information from surveys, 
is likely to assist in the further development of materials and teaching 
techniques to improve understanding. Carrying out interviews during the 
teaching intervention itself, is another potential aid, along with greater 
liaison with teachers during this time. Results of this study suggest that the 
focused interaction of researchers, teachers, and students during a planned 
intervention is likely to produce the greatest benefit in relation to long-term 
outcomes. 
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